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Abstract 


Most learning to rank research has assumed that the utility of different documents is independent, 
which results in learned ranking functions that return redundant results. The few approaches that 
avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present 
a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable 
algorithms that explicitly takes document similarity and ranking context into account. Our formu- 
lation is a non-trivial common generalization of two multi-armed bandit models from the literature: 
ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present 
theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds 
optimizations that improve empirical performance, and shows that our algorithms learn orders of 
magnitude more quickly than previous approaches. 

Keywords: online learning, clickthrough data, diversity, multi-armed bandits, contextual bandits, 
regret, metric spaces 


1. Introduction 


Identifying the most relevant results to a query is a central problem in web search, hence learning 
ranking functions has received a lot of attention (e.g., Joachims, 2002; Burges et al., 2005; Chu 
and Ghahramani, 2005; Taylor et al., 2008). One increasingly important goal is to learn from user 
interactions with search engines, such as clicks. We address the task of learning a ranking function 
that minimizes the likelihood of query abandonment: the event that the user does not click on any of 
the search results for a given query. This objective is particularly interesting as query abandonment 
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is a major challenge in today’s search engines, and is also sensitive to the diversity and redundancy 
among documents presented. 

We consider the Multi-Armed Bandit (MAB) setting (e.g., Cesa-Bianchi and Lugosi, 2006), 
which captures many online learning problems wherein an algorithm chooses sequentially among a 
fixed set of alternatives, traditionally called “arms”. In each round an algorithm chooses an arm and 
collects the corresponding reward. Crucially, the algorithm receives limited feedback—only for the 
arm it has chosen, which gives rise to the tradeoff between exploration (acquiring new information) 
and exploitation (taking advantage of the information available so far). 

While most of the literature on MAB corresponds to learning a single best alternative, MAB 
algorithms can also be extended to learning a ranking of documents that minimizes query abandon- 
ment (Radlinski et al., 2008; Streeter and Golovin, 2008). In this setting, called Ranked Bandits, in 
each round an algorithm chooses an ordered list of k documents from some fixed collection of doc- 
uments, and receives clicks on some of the chosen documents. Crucially, the click probability for 
a given document may depend on the documents shown above: a user scrolls the list top-down and 
may leave as soon as she clicked on the first document. The goal is to minimize query abandonment. 

Radlinski et al. (2008) and Streeter and Golovin (2008) propose a simple but effective approach: 
for each position in the ranking there is a separate instance bandit algorithm which is responsible 
for choosing a document for this position. However, the specific algorithms they considered are 
impractical at WWW scales. 

Prior work on MAB algorithms has considered exploiting structure in the space of arms to 
improve convergence rates. One particular approach, articulated by Kleinberg et al. (2008b) is 
well suited to our scenario: when the arms form a metric space and the payoff function satisfies a 
Lipschitz condition with respect to this metric space. The metric space provides information about 
similarity between arms, which allows the algorithm to make inferences about similar arms without 
exploring them. Further, they propose a “zooming algorithm” which partitions the metric space into 
regions (and treats each region as a “meta-arm’’) so that the partition is adaptively refined over time 
and becomes finer in regions with higher payoffs. 

In web search, a metric space directly models similarity between documents. (It is worth noting 
that most offline learning-to-rank approaches also rely on similarity between documents, at least 
implicitly.) 

Our contributions. This paper initiates the study of bandit learning-to-rank with side informa- 
tion on similarity between documents. We adopt the Ranked bandits setup: a user scrolls the results 
top-down and may leave after a single click, the goal is to minimize query abandonment. The 
similarity information is expressed as a metric space. 

In this paper we consider a “perfect world” scenario: there exists an informative distance func- 
tion which meaningfully describes similarity between documents in a ranked setting, and an algo- 
rithm has access to such function. We focus on two high-level questions: How to represent the 
knowledge of document similarity, and how to use it algorithmically in a bandit setting. We believe 
that studying such “perfect world” scenario is useful, and perhaps necessary, to inform and guide 
the corresponding data-driven work. 

We propose a simple bandit model which combines Ranked bandits (Radlinski et al., 2008) 
and Lipschitz bandits (Kleinberg et al., 2008b), and admits efficient bandit algorithms that, unlike 
those in prior work on bandit learning-to-rank, scale to large document collections. Our model is 
based on the new notion of “conditional Lipschitz continuity” which asserts that similar documents 
have similar click probabilities even conditional on the event that all documents in a given set 
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of documents are skipped (i.e., not clicked on) by the current user. We study this model both 
theoretically and empirically. 

First, we validate the expressiveness of our model by providing an explicit construction for 
a wide family of plausible user distributions which provably fit the model. The analysis of this 
construction is perhaps the most technical contribution of this paper. We also use this construction 
in simulations. 

Second, we put forth a battery of algorithms for our model. Some of these algorithms are 
straightforward combinations of ideas from prior work on Ranked bandits and Lipschitz bandits, 
and some are new. 

A crucial insight in the new algorithms is that for each position i in the ranking there is a context 
that we can use, namely the set of documents chosen for the above positions in the same round. 
Indeed, since our objective is non-abandonment we only care about position i if all documents 
shown above i have been skipped in the present round. So the algorithm responsible for position i 
can simply assume that these documents have been skipped. 

This interpretation of contexts allows us to cast the position-i problem as a contextual bandit 
problem. Moreover, we derive a Lipschitz condition on contexts (with respect to a suitably defined 
metric), which allows us to use the contextual Lipschitz MAB machinery from Slivkins (2009). We 
also exploit correlations between clicks: if a given document is included in the context—that is, if 
this document is skipped by the current user—then similar documents are likely to be skipped, too. 
More specifically, we propose two algorithms that use contexts: a “heavy-weight” algorithm which 
uses both the metric on contexts and correlated clicks, and a “light-weight” algorithm which uses 
correlated clicks but not the metric on contexts. 

Third, we provide scalability guarantees for the heavy-weight contextual algorithm, proving that 
the convergence rate depends only on the dimensionality of the metric space but not on the number 
of documents. However, we argue that our provable guarantees do not fully reflect the power of the 
algorithm, and outline some directions for the follow-up theoretical work. In particular, we identify 
a stronger benchmark and discuss convergence to this benchmark. We provide an initial result: we 
prove, without any guarantees on the convergence rate, that the heavy-weight contextual algorithm 
indeed converges to this stonger benchmark. This theoretical discussion is one of the contributions. 

Finally, we empirically study the performance of our algorithms. We run a large-scale simula- 
tion using the above-mentioned construction with realistic parameters. The main goal is to compare 
the convergence rates of the various approaches. In particular, we confirm that metric-aware algo- 
rithms significantly outperform the metric-oblivious ones, and that taking the context into account 
improves the convergence rate. Somewhat surprisingly, our light-weight contextual algorithm per- 
forms better than the heavy-weight one. 

A secondary, smaller-scale experiment studies the limit behaviour of the algorithms, that is, the 
query abandonment probability that the algorithms converge to. Following the theoretical discus- 
sion mentioned above, we design a principled example on which different algorithms exhibit very 
different limit behaviour. Interestingly, the heavy-weight contextual algorithm is the only algorithm 
that achieves the optimal limit behaviour in this experiment. 

Map of the paper. We start with a brief survey of related work (Section 2). We define our 
model in Section 3, and validate its expressiveness in Section 4. In-depth discussion of relevant 
approaches from prior work is in Section 5. Our new approach, ranked contextual bandits in metric 
spaces, is presented in Section 6. Scalability guarantees are discussed in Section 7. We present our 
simulations in Section 8. 
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To keep the flow of the paper, the lengthy proofs for the theoretical results in Section 4 are 
presented in Section A and Section B. Moreover, the background on instance-dependent regret 
bounds for UCB1-style algorithms is discussed in Appendix C. 


2. Related Work on Multi-Armed Bandits 


Multi-armed bandits have been studied for many decades as a simple yet expressive model for under- 
standing exploration-exploitation tradeoffs. A thorough discussion of the literature on bandit prob- 
lems is beyond the scope of this paper. For background, a reader can refer to a book (Cesa-Bianchi 
and Lugosi, 2006) and a recent survey (Bubeck and Cesa-Bianchi, 2012) on regret-minimizing 
bandits.' A somewhat different, Bayesian perspective can be found in surveys (Sundaram, 2005; 
Bergemann and Välimäki, 2006). 

On a very high level, there is a crucial distinction between regret-minimizing formulations and 
Bayesian/MDP formulations (see the surveys mentioned above); this paper follows the former. 
Among regret-minimizing formulations, an important distinction is between stochastic rewards (Lai 
and Robbins, 1985; Auer et al., 2002a) and adversarial rewards (Auer et al., 2002b). 

Below we survey several directions that are directly relevant to this paper. 

Ranked bandits. A bandit model in which an algorithm learns a ranking of documents with a 
goal to minimize query abandonment has been introduced in Radlinski et al. (2008) under the name 
ranked bandits. A crucial feature in this setting is that the click probability for a given document 
may depend not only on the document and the position in which it is shown, but also the documents 
shown above. In particular, documents shown above can “steal” clicks from the documents shown 
below, in the sense that a user scrolls the list top-down and may leave as soon as she clicked on the 
first document. 

Independently, Streeter and Golovin (2008) considered a more general model where the goal 
is to minimize an arbitrary (known) submodular set function, rather than query abandonment. A 
further generalization to submodular functions on ordered assignements (rather than on sets) was 
considered in (Golovin et al., 2009). The contributions of the three papers essentially coincide for 
the special case of ranked bandits. 

Uchiya et al. (2010)* and Kale et al. (2010)* considered a related bandit model in which an 
algorithm selects a ranking of documents in each round, but the click probabilities for a given 
document do not depend on which other documents are shown to the same user. 

Bandits with structure. Numerous papers enriched the basic MAB setting by assuming some 
structure on arms, typically in order to handle settings where the number of arms is very large or 
infinite. Most relevant to this paper is the model where arms lie in a metric space and their expected 
rewards satisfy the Lipschitz condition with respect to this metric space (see Section 3 for details). 
This model, for a general metric space, has been introduced in Kleinberg et al. (2008b) under the 
name Lipschitz MAB; the special case of unit interval has been studied in (Agrawal, 1995; Kleinberg, 
2004; Auer et al., 2007) under the name continuum-armed bandits. Subsequent work on Lipschtz 
MAB includes Bubeck et al. (2011), Kleinberg and Slivkins (2010), Maillard and Munos (2010), 
Slivkins (2009) and Slivkins (2011). A closely related model posits that arms corresponds to leaves 





1. Regret of an algorithm in T rounds, typically denoted R(T), is the expected payoff of the benchmark in T rounds 
minus that of the algorithm. A standard benchmark is the best arm in hindsight. 
2. This is either concurrent or subsequent work with respect to the conference publication of this paper. 
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on a tree, but no metric space is revealed to the algorithm (Kocsis and Szepesvari, 2006; Pandey 
et al., 2007; Munos and Coquelin, 2007; Slivkins, 2011). 

Another commonly assumed structure is linear or convex payoffs (e.g., Awerbuch and Klein- 
berg, 2008; Flaxman et al., 2005; Dani et al., 2007; Abernethy et al., 2008; Hazan and Kale, 2009). 
Linear/convex payoffs is a much stronger assumption than similarity, essentially because it allows 
to make strong inferences about far-away arms. Other structural assumptions have been considered, 
for example, Wang et al. (2008) and Bubeck and Munos (2010) and Srinivas et al. (2010). 

The distinction between the various possible structural assumptions is orthogonal to the dis- 
tinction between stochastic and adversarial rewards. With a few exceptions, papers on MAB with 
linear/convex payoffs allow adversarial payoffs, whereas papers on MAB with similarity informa- 
tion focus on stochastic payoffs 

Contextual bandits. Here in each round the algorithm receives a context, chooses an arm, and 
the reward depends both on the arm and the context. The term “contextual bandits” was coined 
in Langford and Zhang (2007). The setting, with a number of different modifications, has been 
introduced independently in several papers; a possibly incomplete list is Woodroofe (1979), Auer 
et al. (2002b), Auer (2002), Wang et al. (2005), Langford and Zhang (2007), Hazan and Megiddo 
(2007) and Pandey et al. (2007). 

There are several models for how contexts are related to rewards: rewards are linear in the 
context (e.g., Auer, 2002; Langford and Zhang, 2007) and Chu et al. (2011), the context is a 
random variable correlated with rewards (Woodroofe, 1979; Wang et al., 2005; Rigollet and Zeevi, 
2010); rewards are Lipschitz with respect to a metric space on contexts (Hazan and Megiddo, 2007; 
Slivkins, 2009) and Lu et al. (2010)?. 

Most work on contextual bandits has been theoretical in nature; experimental work on contextual 
MAB includes Pandey et al. (2007) and Li et al. (2010, 2011). 


3. Problem Formalization: Ranked Bandits in Metric Spaces 


Let us introduce the online learning-to-rank problem that we study in this paper. 

Ranked bandits. Following Radlinski et al. (2008), we are interested in learning an optimally 
diverse ranking of documents for a given query. We model it as a ranked bandit problem as follows. 
Let X be a set of documents (“arms”). Each ‘user’ is represented by a binary relevance vector: a 
function T : X — {0,1}. A document x € X is called “relevant” to the user if and only if n(x) = 1. 
Let Fy be the set of all possible relevance vectors. Users come from a distribution P on Fy that is 
fixed but not revealed to an algorithm.’ This P will henceforth be called the user distribution. 

In each round, the following happens: a user arrives, sampled independently from ®P; an algo- 
rithm outputs a list of k documents; the user scans this list top-down, and clicks on the first relevant 
document. The goal is to maximize the expected fraction of satisfied users: users who click on at 
least one document. Note that in contrast with prior work on diversifying existing rankings (e.g., 
Carbonell and Goldstein, 1998), the algorithm needs to directly learn a diverse ranking. 

Since we count satisfied users rather than the clicks themselves, we can assume w.l.o.g. that 
a user leaves once she clicks once. (Alternatively, the algorithm does not record any subsequent 
clicks.) A user is satisfied or not satisfied independently of the order in which she scans the results. 
However, the assumption of the top-down scan determines the feedback received by the algorithm, 
that is, which document gets clicked. 





3. This also models users for whom documents are probabilistically relevant (Radlinski et al., 2008). 
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We will say that there are k slots to be filled in each round, so that when the algorithm outputs 
the list of k documents, the i-th document in this list appears in slot 7. Note that the standard model 
of MAB with stochastic rewards (e.g., Auer et al., 2002a) is a special case with a single slot (k = 1). 

Click probabilities. Recall that P is a distribution over relevance vectors. The pointwise mean 
of P is a function u : X — [0,1] such that u(x) = Ez~p[n(x)]. Thus, u(x) is the click probability for 
document x if it appears in the top slot. 

Each slot i > 1 is examined by the user only in the event that all documents in the higher slots are 
not clicked, so the relevant click probabilities for this slot are conditional on this event. Formally, 
fix a subset of documents S C X and let Zs = {x(-) = 0 on S} be the event that all documents in S 
are not relevant to the user. Let (P|Zs) be the distribution of users obtained by conditioning P on 
this event, and let u(- |Zs) be its pointwise mean. Then u(x|Zs) is the click probability for document 
x if S is the set of documents shown above x in the same round. 














Metric spaces. Throughout the paper, let (X,D) be a metric space.That is, X is a set and D is 
a symmetric function on X x X — [0,%] such that D(x,y) =0 <= x = y, and D(x,y) + D(y,z) > 
D(x,z) (triangle inequality). 

A function v : X —> R is said to be Lipschitz-continuous with respect to (X,D) if 


lv(x) — v(y)| < D(x,y) for all x,y E€ X. (1) 


Throughout the paper, we will write L-continuous for brevity. 

A user distribution ? is called L-continuous with respect to (X,D) if its pointwise mean u is 
L-continuous with respect to (X,D). 

Document similarity. To allow us to incorporate information about similarity between docu- 
ments, we start with the model, called Lipschitz MAB, proposed by Kleinberg et al. (2008b) for 
the standard (single-slot) bandits. In this model, an algorithm is given a metric space (X,D) with 
respect to which the pointwise mean u is L-continuous.4 

While this model suffices for learning the document at the top slot (see Kleinberg et al., 2008b 
for details), it is not sufficiently informative for lower slots. This is because the relevant click 
probabilities u(-|Zs) are conditional and therefore are not directly constrained by L-continuity. To 
enable efficient learning in all k slots, we will assume a stronger property called conditional L- 
continuity: 


Definition 1 ? is called conditionally L-continuous w.r.t. (X,D) if the conditional pointwise mean 
u(-|Zs) is L-continuous for all S C X. 


Now, a document x in slot i > 1 is examined only if event Zs happens, where S is the set of 
documents in the higher slots: that is, if all documents in the higher slots are not relevant to the 
user. The document x has a conditional click probability u(x|Zs). The function u(-|Zs) satisfies the 
Lipschitz condition (1), which will allow us to use the machinery from MAB problems on metric 
spaces. 

Formally, we define the k-slot Lipschitz MAB problem, an instance of which consists of a triple 
(X,D, P), where (X,D) is a metric space that is known to an algorithm, and is a latent user 
distribution which is conditionally L-continuous w.r.t. (X,D). 





4. One only needs to assume that similarity between any two documents x,y is summarized by a number ôx, such that 
|u(x) — u(y)| < xy. Then one obtains a metric space by taking the shortest paths closure. 
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Note that the k-slot Lipschitz MAB problem subsumes the “metric-free” ranked bandit problem 
from Radlinski et al. (2008) (as a special case with a trivial metric space in which all distances are 
equal to 1) and the Lipschitz MAB problem from Kleinberg et al. (2008b) (as a special case with a 
single slot). 


3.1 Metric Space: A Running Example 


Web documents are often classified into hierarchies, where closer pairs are more similar.> For 


evaluation, we assume the documents X fall in such a tree, with each document x € X a leaf in the 
tree. On this tree, we consider a very natural metric: the distance between any two tree nodes u,v is 


exponential in the height (i.e., the hop-count distance to the root) of their least common ancestor: 
D(u,v) = c x ebetgnt(Lca(uy)) 


for some constant c and base £ € (0,1). We call this the €-exponential tree metric (with constant c). 
However, our algorithms and analyses extend to arbitrary metric spaces. 


3.2 Alternative Notion of Document Similarity 


An alternative notion of document similarity focuses on correlated relevance: correlation between 
the relevance of two documents to a given user. We express “similarity” by bounding the probability 
of the “discorrelation event” {m(x) 4 n(y)}. Specifically, we consider conditional L-correlation, 
defined as follows: 


Definition 2 Call P L-correlated w.rt. (X,D) if 
Pr (x(x) #m(y)] < Day) Vxy Ex. (2) 
Call P conditionally L-correlated wrt. (X,D) if (2) holds conditional on Zs for any S C X, that is, 


Pr [a(x) AM(y)] <D(w,y) Yx,yEX,SCX. 
n~(P|Zs) 


It is easy to see that conditional L-correlation implies conditional L-continuity. In fact, we show 
that the two notions are essentially equivalent. Namely, we prove that conditional L-continuity 
w.r.t. (X, D) implies conditional L-correlation w.r.t. (X,2D). 


Lemma 3 Consider an instance (X,D, P) of the k-slot Lipschitz MAB problem. Then the user 
distribution P is conditionally L-correlated w.r.t. (X,2D). 


Proof Fix documents x,y € X and a subset S C X. For brevity, write “x = 1” to mean “r(x) = 1”, 
etc. We claim that 


Prix = 1 Ay =0|Zs] < D(x,y). (3) 


Indeed, consider the event Z = Zs. ;,. Applying the Bayes theorem to (P|Zs), we obtain that 





u(x|Z) = Prlx = 1 | {y = 0} AZs] 
_ Prix= 1Ay=0|Zs] 
— Prly =0|Zs] 








(4) 





5. One example of such hierarchical classification is the Open Directory Project (http: //dmoz.org). 
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On the other hand, since u(y|Z) = 0, by conditional L-continuity it holds that 
M(x|Z) = |uO|Z) —eOIZ)| < DO), (5) 


so claim (3) follows from Equation (4) and Equation (5). 
Likewise, Prix = 0A y = 1|Zs] < D(x,y). Since 





{n(x) # Hy} = {x= 1Ay=O0}U{x=0Ay= If, 


it follows that Pr[z(x) 4 m(y) |Zs] < 2D(x,y). a 





4. Expressiveness of the Model 


Our approach relies on the conditional L-continuity (equivalently, conditional L-correlation) of the 
user distribution. How “expressive” is this assumption, that is, how rich and “interesting” is the 
collection of problem instances that satisfy it? While the unconditional L-continuity assumption 
is usually considered reasonable from the expressiveness point of view, even the unconditional L- 
correlation (let alone the conditional L-correlation) is a very non-trivial property about correlated 
relevance, and thus potentially problematic. A related concern is how to generate a suitable collec- 
tion of problem instances for simulation experiments. 

We address both concerns by defining a natural (albeit highly stylized) generative model for the 
user distribution, which we then use in the experiments in Section 8. We start with a tree metric 
space (X,D) and the desired pointwise mean u : X — (0, 5] that is L-continuous w.r.t. (X,D). The 
generative model provides a rich family of user distributions that are conditionally L-continuous 
w.r.t. (X,cD), for some small c. This result is a key theoretical contribution of this paper (and by 
far the most technical one). 

We develop the generative model in Section 4.1. We extend this result to arbitrary metric spaces 
in Section 4.2, and to distributions over conditionally L-continuous user distributions in Section 4.3. 
To keep the flow of the paper, the detailed analysis is deferred to Section A and Section B. 


4.1 Bayesian Tree Network 


The generative model is a tree-shaped Bayesian network with 0-1 “relevance values” 1(-) on nodes, 
where leaves correspond to documents. The tree is essentially a topical taxonomy on documents: 
subtopics correspond to subtrees. The relevance value on each sub-topic is obtained from that on 
the parent topic via a low-probability mutation. 

The mutation probabilities need to be chosen so as to guarantee conditional L-continuity and 
the desired pointwise mean u. It is fairly easy to derive a necessary and sufficient condition for the 
pointwise mean, and a necessary condition for conditional L-continuity. The latter condition states 
that the mutation probabilities need to be bounded in terms of the distance between the child and 
the parent. The hard part is to prove that this condition is sufficient. 

Let us describe our Bayesian tree network in detail. The network inputs a tree metric space 
(X,D) and the desired pointwise mean u, and outputs a relevance vector 1: X — {0,1}. Specifically, 
we assume that documents are leaves of a finite rooted edge-weighted tree, which we denote Tg, with 
node set V and leaf set X C V, so that D is a (weighted) shortest-paths metric on V. 
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Algorithm 1 User distribution for tree metrics 
Input: Tree (root r, node set V); u(r) € [0, 1] 
mutation probabilities qo, qı : V > [0,1] 
Output: relevance vector 7: V > {0,1} 





function AssignClicks(tree node v) 
b+ n(v) 
for each child u of v do 
meas 1—b wiprob qatu) 
b otherwise 
AssignClicks(u) 


Pick m(r) € {0,1} at random with expectation u(r) 
AssignClicks(r) 





Recall that u is L-continuous w.r.t. (X,D). We assume that u takes values in the interval [æ, 5], 
for some constant parameter œ > 0. We show that u can be extended from X to V preserving the 
range and L-continuity (see Section A for the proof). 


Lemma 4 u can be extended to V so that u: V > (a, 5] is L-continuous w.rt. (V,D). 


In what follows, by a slight abuse of notation we will assume that the domain of u is V, with the 
same range |æ, 5], and that u is L-continuous w.r.t. (V,D). Also, we redefine the relevance vectors 
to be functions V — {0,1} rather than X — {0,1}. 

The Bayesian network itself is very intuitive. We pick m(root) € {0,1} at random with a suitable 
expectation u(root), and then proceed top-down so that the child’s bit is obtained from the parent’s 
bit via a low-probability mutation. The mutation is parameterized by functions qọ,q1 : V — [0,1], 
as described in Algorithm 1: for each node u, if the parent’s bit is set to b then the mutation {1(u) = 
1 — b} happens with probability qp(u). These parameters let us vary the degree of independence 
between each child and its parent, resulting in a rich family of user distributions. 

To complete the construction, it remains to define the mutation probabilities go,q,. Let P be the 
resulting user distribution. It is easy to see that u is the pointwise mean of P on V if and only if 


(u) = (1 — u(v)) go(u) +u) — ai (w)) (6) 


whenever u is a child of v. (For sufficiency, use induction on the tree.) Further, letting qp = qp (u) 
for each bit b € {0,1}, note that 


Pr[n(u) A n(v)] = u(v) gi + (1 — u(v))qo 
= u(v)(qo +q1) + (1 — 2u(v)) qo 
> u(v)(qo +41). 


Thus, if P is L-correlated w.r.t. (X, D) then 


qolu) + qi(u) < D(u, v) /u(). (7) 


We show that (6-7) suffices to guarantee conditional L-continuity. 
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For a concrete example, one could define 


o cute) = | One”) mo) 2 He) 
u), u)) = 
qo qı ulu ) KAO, 0) otherwise. 


(8) 
The go, qi defined as above satisfy (6-7) for any u that is L-continuous on (V, D). 

The provable properties of Algorithm 1 are summarized in the theorem below. It is technically 
more convenient to state this theorem in terms of L-correlation rather than L-continuity. 


Theorem 5 Let D be the shortest-paths metric of an edge-weighted rooted tree with a finite leaf set 
X. Letu: X > |Q, 4], a > 0 be L-continuous w.r.t. (X,D). Suppose qo,qi : V —> [0, 1] satisfy (6-7). 

Let P be the user distribution constructed by Algorithm 1. Then P has pointwise mean u and is 
conditionally L-correlated w.r.t. (X,3 D„) where 


D, (x,y) = D(x,y) min (4, avn) ; 


Remark. The theorem can be strengthened by replacing D, with the shortest-paths metric induced 
by Dy. 


Below we provide a proof sketch. The detailed proof is presented in Section B. 

Proof Sketch As we noted above, the statement about the pointwise mean trivially follows from Equa- 
tion (6) using induction on the tree. In what follows we focus on conditional L-correlation. 

Fix leaves x,y € X and a subset S C X. Let z be the least common ancestor of x,y. Recall 
that in Algorithm 1 the bit 2(-) at each node is a random mutation of that of its parent. We focus 
on the event £ that no mutation happened on the z — x and z — y paths. Note that £ implies 
n(x) = 1(y) = n(z). Therefore 


Pr[n(x) # my) |Zs] < Pr[E|Zs], 


where £ is the negation of £. Intuitively, £ is a low-probability “failure event”. The rest of the 
proof is concerned with showing that Pr[£|Zs] < 3 D,,(x,y). 
First we handle the unconditional case. We claim that 


Pr[£] < D,(x,y). (9) 


Note that Equation (9) immediately implies that P is L-correlated w.r.t. (X,D,). This claim is not 
very difficult to prove, essentially since the condition in Equation (7) is specifically engineered to 
satisfy the unconditional L-correlation property. We provide the proof in detail. 

Let w € argmin,<p, u(u), where P,y is the x —> y path. Let (z = x0,%1,... ,%, = x) be the 
z — x path. For each i > 1 by Equation (7) the probability of having a mutation at x; is at most 
D(xi,xi-1)/u(w), so the probability of having a mutation on the z > x path is at most D(x, z)/u(w). 
Likewise for the z — y path. So Pr[£] < D(x,y)/u(w) < D(x, y)/a. 

It remains to prove that 


Pr[£] < D(x, y) (10) 


Es, 
u(x) +u) 
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Indeed, by L-continuity it holds that 
uw) > u(x) — D(x, w), 
uw) > uly) -DQ w). 
Since D(x, y) = D(x,w) + D(y,w), it follows that 
(x)+u(y)-D(xy) 
p(w) > Beebe Pe) (1) 


Now, either the right-hand side of Equation (11) is at least “ ae Q ) or the right-hand side of Equa- 
tion (10) is at least 1. In both cases Equation (10) holds. This completes the proof of the claim (9). 


The conditional case is much more difficult. We handle it by showing that 





Pr[E | Zs] < 3 Pr[£]. (12) 


In fact, Equation (12) holds even if Equation (7) is replaced with a much weaker bound: 
max(qo(u), qı (u)) < 4 for each u. 

The mathematically subtle proof of Equation (12) can be found in Section B. The crux in this 
proof is that event Zs is more likely if document z is not relevant to the user: 


Pr[Zs|z= 0] > Pr[Zs|z= 1]. 


4.2 Arbitrary Metric Spaces 


We can extend Theorem 3.1 to arbitrary metric spaces using prior work on metric embeddings. Fix 
an N-point metric space (X,D) and a function u : X — |æ, 5] that is L-continuous on (X,D). It 
is known (Bartal, 1996; Fakcharoenphol et al., 2004) that there exists a distribution Pree over tree 
metric spaces (X, T) such that D(x,y) < T (x,y) and 











ETA Pree [T (x,y)] < cD(x,y) Vx, y € X, 





where c = O(logN).° 
Our construction (Algorithm 2) is simple: first sample a tree metric space (X, T) from Pree, 
then independently generate a user distribution Py for (X, T) as per Algorithm 1. 


Theorem 6 The user distribution P produced by Algorithm 2 has pointwise mean u and is condi- 
tionally L-correlated w.r.t. (X,3cD,), where D, is given by 


Dy(x,y) = D(x,y) min (4 , aa) 7 





6. This is the main result in Fakcharoenphol et al. (2004), which improves on an earlier result in Bartal (1996) with 
c = O(log? N). For point sets in a d-dimensional Euclidean space one could take c = O(dlog L), Where £ is the 
minimal distance. In fact, this result extends to a much more general family of metric spaces—those of doubling 
dimension d (Gupta et al., 2003). Doubling dimension, the smallest d such that any ball can be covered by 2¢ balls 
of half the radius, has been introduced to the theoretical computer science literature in Gupta et al. (2003), and has 
been a well-studied concept since then. 
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Algorithm 2 User distribution for arbitrary metric spaces 





Input: metric space (X,D); function u : X — [a, 4] that is L-continuous on (X, D). 
Output: relevance vector 1: X — {0,1} 


1. Sample a tree metric space (X, T) from Pree, 
2. Run Algorithm 1 for (X,Z), output the resulting 7. 





Proof The function u is L-continuous w.r.t. each tree metric space (X, T), so by Theorem 3.1 user 
distribution Py has pointwise mean u and is conditionally L-correlated w.r.t. (X,37,,). It follows 
that the aggregate user distribution P has pointwise mean u, and moreover for any x,y E€ X and 
S C X we have 


Pr In) # my) |Zs] 





< ETR | Pr [t(x) # TO) [Zs] 


T~ Pr 


ET ~ Bree 3 Taly) 
3c Dy (x,y). 

















4.3 Distributions over User Distributions 


Let us verify that conditional L-continuity is robust, in the sense that any distribution over condition- 
ally L-continuous user distributions is itself conditionally L-continuous. This result considerably 
extends the family of user distributions for which we have conditional L-continuity guarantees. 


Lemma 7 Let P be a distribution over countably many user distributions P; that are conditionally 
L-continuous w.r.t. a metric space (X,D). Then P is conditionally L-continuous w.r.t. (X,D). 


Proof Let u and u; be the (conditional) pointwise means of P and ®;,, respectively. Formally, let 
us treat each P, as a measure, so that P(E) is the probability of event E under P;. Let P= Yq; P, 
where {q;} are positive coefficients that sum up to 1. Fix documents x,y € X and a subset S$ C X. 
Then 














P(x =] A Zs) 
H(x|S) = P(x = 1|Zs) BZ) 
— Yigi P(x = 1AZsz) 
> P(Zs) 
— Yigi Pi(Zs) ui(x|Zs) 
= P(Zs) 
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It follows that 


|u(x|S) — uls) 
_ didi Pi(Zs) (ui(x|Zs) —wi(y|Zs)) 
P(Zs) 
< Vidi Pi(Zs) D(x,y) 
= P(Zs) 
< D(x,y). 








5. Algorithms from Prior Work 


Let us discuss some algorithmic ideas from prior work that can be adapted to our setting. Interest- 
ingly, one can combine these algorithms in a modular way, which we make particularly transparent 
by putting forward a suitable naming scheme. Throughout this section, we let Bandit be some 
algorithm for the MAB problem. 


5.1 Ranked Bandits 


Given some bandit algorithm Bandit, the “ranked” algorithm RankBandit for the multi-slot MAB 
problem is defined as follows (Radlinski et al., 2008). We have k slots (i.e., ranks) for which we wish 
to find the best documents to present. In each slot i, a separate instance 4; of Bandit is created. 
In each round these instances select the documents to show independently of one another. If a user 
clicks on slot i, then this slot receives a reward of 1, and all higher (i.e., skipped) slots j < i receive 
a reward of 0. For slots j > i, the state is rolled back as if this round had never happened (as if the 
user never considered these documents). If no slot is clicked, then all slots receive a reward of 0. 

Let us emphasize that the above approach can be applied to any algorithm Bandit. In Radlinski 
et al. (2008), this approach gives rise to algorithms RankUCB1 and RankEXP3, based on MAB algo- 
rithms UCB1 and EXP3(Auer et al., 2002a,b). EXP3 is designed for the adversarial setting with no 
assumptions on how the clicks are generated, which translates into concrete provable guarantees for 
RankEXP3. UCB1 is geared towards the stochastic setting with i.i.d. rewards on each arm, although 
the per-slot i.i.d. assumption breaks for slots i > 1 because of the influence of the higher slots. Nev- 
ertheless, in small-scale experiments RankUCB1 performs much better than RankEXP3(Radlinski 
et al., 2008). 

Provable guarantees. Letting T be the number of rounds and OPT be the probability of clicking 
on the optimal ranking, algorithm RankBandit achieves 














E[#clicks] > (1—4)T x OPT—kR(T), (13) 


where R(T) is any upper bound on regret for Bandit in each slot (Radlinski et al., 2008; Streeter 
and Golovin, 2008). 

In the multi-slot setting, performance of an algorithm up to time T is defined as the time- 
averaged expected total number of clicks. We will consider performance as a function of T. Assum- 
ing R(T) = o(T) in Equation (13), performance of RankBandit converges to or exceeds (1 — +)OPT. 
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Convergence to (1 — 4)0PT is proved to be worst-case optimal. Thus, as long as R(T) scales well 
with time, for the document collection sizes that are typical for the application at hand, Radlinski 
et al. (2008) interpret Equation (13) as a proof of an algorithm’s scalability in the multi-slot MAB 
setting. 

RankBandit is presented in Radlinski et al. (2008) as the online version of the greedy algorithm: 
an offline fully informed algorithm that selects documents greedily slot by slot from top to bottom. 
The performance of this algorithm is called the greedy optimum,’ which is equal to (1 — *) OPT in 
the worst case, but for “benign” problem instances it can be as good as OPT. The greedy optimum is 
a more natural benchmark for RankBandit than (1 — *)OPT. However, results w.r.t. this benchmark 
are absent in the literature.® 


5.2 Lipschitz Bandits 


Both UCB1 and EXP3 are impractical when there are too many documents to explore them all. To 
alleviate this issue, one can use the similarity information provided by the metric space and the 
Lipschitz assumption; this setting is called Lipschitz MAB. 

Below we describe two “metric-aware” algorithms from Kleinberg (2004) and Kleinberg et al. 
(2008b). Both are well-defined for arbitrary metric spaces, but for simplicity we present them for a 
special case in which documents are leaves in a document tree (denoted Tq) with an €-exponential 
tree metric. In both algorithms, a subtree is chosen in each round, then a document in this subtree 
is sampled at random, choosing uniformly at each branch. 

Given some bandit algorithm Bandit, Kleinberg (2004) define algorithm GridBandit for the 
Lipschitz MAB setting. This algorithm proceeds in phases: in phase i, the depth-i subtrees are 
treated as “arms”, and a fresh copy of Bandit is run on these arms.” Phase i lasts for ke~~! rounds, 
where k is the number of depth-i subtrees. This meta-algorithm, coupled with an adversarial MAB 
algorithm such as EXP3, is the only algorithm in the literature that takes advantage of the metric 
space in the adversarial setting. Following Radlinski et al. (2008), we expect GridEXP3 to be 
overly pessimistic for our problem, trumped by the corresponding stochastic MAB approaches such 
as GridUCB1. 

The “zooming algorithm” (Kleinberg et al., 2008b, Algorithm 3) is a more efficient version 
of GridUCB1: instead of iteratively reducing the grid size in the entire metric space, it adaptively 
refines the grid in promising areas. It maintains a set A of active subtrees which collectively partition 
the leaf set. In each round the active subtree with the maximal index is chosen. The index of a 
subtree is (assuming stochastic rewards) the best available upper confidence bound on the click 
probabilities in this subtree. It is defined via the confidence radius!” given (letting T be the time 
horizon) by 





rad(-) © \/4log(T)/(1+#samples(-)). (14) 


The algorithm “zooms in” on a given active subtree u (de-activates u and activates all its chil- 
dren) when rad(u) becomes smaller than its width w(u) = e%P(™ = max, yve,D(x,x’). 





7. If due to ties there are multiple “greedy rankings”, define the greedy optimum via the worst of them. 

8. Following the conference publication of this paper, Streeter and Golovin claimed that the techniques in Streeter and 
Golovin (2008) can be used to extend Equation (13) to the greedy optimum benchmark. If so, then it may be possible 
to use the same approach to improve our guarantees. 

9. As an empirical optimization, previous events can also be replayed to better initialize later phases. 

10. The meaning of rad(-) is that w.h.p. the sample average is within +rad(-) from the true mean. 


412 


RANKED BANDITS IN METRIC SPACES 


Algorithm 3 “Zooming algorithm” in trees 
initialize (document tree Tq): 
A+; activate(root(Ta)) 





activate( u € nodes(Ty) ): 
A} AU{u}; n(u)<—0; r(u) <0 


Main loop: 
u + argmax, <q index(u), 
where index(u) = aa +2 rad(u) 
“Play” a random document from subt ree(u) 
r(u) — r(u) + {reward}; n(u) + n(u) +1 
if rad(u) < W(u) then 
deactivate u: remove u from 4 
activate all children of u 





Provable guarantees. Regret guarantees for the two algorithms above are independent of the 
number of arms (which, in particular, can be infinite). Instead, they depend on the covering prop- 
erties of the metric space (X,D). A crucial notion here is the covering number N,(X), defined as 
the minimal number of balls of radius r sufficient to cover X. It is often useful to summarize the 
covering numbers N,(X), r > 0 with a single number called the covering dimension: 


CovDim(X,D) £ inf{d>0:N,(X)<ar4 Yr>0}. (15) 


(Here a > 0 is a constant which we will keep implicit in the notation.) In particular, for an arbitrary 
point set in R? under the standard (£2) distance, the covering dimension is d, for some & = O(1). 
For an €-exponential tree metric with maximal branching factor b, the covering dimension is d = 
log) je(b), witha = 1. 

Against an oblivious adversary, GridEXP3 has regret 


R(T) = O(a TAI?) (16) 


where d is the covering dimension of (X,D). 

For the stochastic setting, GridUCB1 and the zooming algorithm enjoy strong instance-dependent 
regret guarantees. These guarantees reduce to Equation (16) in the worst case, but are much better 
for “nice” problem instances. Informally, regret guarantees improve for problem instances in which 
the set of near-optimal arms has smaller covering numbers than the set of all arms. Regret guar- 
antees for the zooming algorithm are (typically) much stronger than for GridUCB1. In particular, 
one can derive a version of Equation (16) with a different d called the zooming dimension, which 
is equal to the covering dimension in the worst case but can be much smaller, even d = 0. These 
issues are further discussed in Appendix C. 


5.3 Anytime Guarantees and the Doubling Trick 


While the zooming algorithm, and also the contextual zooming algorithm from Section 5.5, are 
defined for a fixed time horizon, one can obtain the corresponding anytime versions using a simple 
doubling trick: in each phase i € N, run a fresh instance of the algorithm for 2' rounds. These 
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versions are run indefinitely and enjoy the same asymptotic upper bounds on regret as the original 
algorithms (but now these bounds hold for each round). 


5.4 Ranked Bandits in Metric Spaces 


Using and combining the algorithms in the previous two subsections, we obtain the following battery 
of algorithms for k-slot Lipschitz MAB problem: 

e metric-oblivious algorithms: RankUCB1 and RankEXP3. 

e simple metric-aware algorithms: RankGridUCB1 and RankGridEXP3 

(ranked versions of GridUCB1 and GridEXP3, respectively). 

e RankZoom: the ranked version of the zooming algorithm. 

In theory, RankGridEXP3 scales to large document collections, in the sense that it achieves Equa- 
tion (13) with R(T) that does not degenerate with #documents: 


Theorem 8 Consider the k-slot Lipschitz MAB problem on a metric space with covering dimension 
d (as defined in Equation (15), with constant ). Then after T rounds RankGridEXP3 achieves 











K|#clicks| 1 x ak 


The theorem follows from the respective regret bounds for GridEXP3 (Equation (16)) and Rank- 
Bandit (Equation (13)). We do not have any provable guarantees for other algorithms because 
the corresponding regret bounds for the single-slot setting do not directly plug into Equation (13). 
However, the strong instance-dependent guarantees for GridUCB1 and especially for the zooming 
algorithm (even though they do not directly apply to the ranked bandit setting) suggest that Rank- 
GridUCB1 and RankZoom are promising. We shall see that these two algorithms perform much 
better than RankGridEXP3 in the experiments. 








5.5 Contextual Lipschitz Bandits 


We also leverage prior work on contextual bandits. The relevant contextual MAB setting, called 
contextual Lipschitz MAB, is as follows. In each round nature reveals a context h, an algorithm 
chooses a document x, and the resulting reward is an independent {0,1} sample with expectation 
u(x|h). Further, one is given similarity information: metrics D and D, on documents and contexts, 
respectively, such that for any two documents x,x’ and any two contexts h, h’ we have 


lu(x|h) — u(x |h’)| < Dx, x") + De(h,h’). 


Let X. be the set of contexts, and Xgc = X x Xe be the set of all (document, context) pairs. 
Abstractly, one considers the metric space (Xdc, Dac), henceforth the DC-space, where the metric is 


Dac((x,h), (x, h')) = D(x, x’) + De(h, h’). 


We will use the “contextual zooming algorithm” (ContextZoom) from Slivkins (2009). This 
algorithm is well-defined for arbitrary Dac, but for simplicity we will state it for the case when D 
and D, are €-exponential tree metrics. 

Let us assume that documents and contexts are leaves in a document tree Tg and context tree Te, 
respectively. The algorithm (see Algorithm 4 for pseudocode) maintains a set A of active strategies 
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Algorithm 4 ContextZoom in trees 
initialize (document tree Ty, context tree Te): 
A+; activate( root (ta), root(Te) ) 





activate ( u € nodes(Ta), Ue E nodes (Te) ): 
A4 AU{(u,uc)}; n(u,uc) — 0; r(u,uc) — 0 


Main loop: 
Input a context h € nodes (Te) 
(u,Uc)<- argmax index(u, uc), 
(u, uc) EA: hEuc 


where index(u, uc) = W(u X uc) + r(t) + rad(u, uc) 


“Play” a random document from subt ree(u) 
r(u,uc) 4+ r(u,uc) + {reward}; n(u, uc) 4 n(u,uc)+ 1 
if rad(u, uc) < W(u, uc) then 

deactivate (u, uc): remove (u, uc) from A 

activate all pairs (child(u), child(u,)) 








of the form (u, uc), where u is a subtree in Tg and uc is a subtree in Te. At any given time the active 
strategies partition Xac. In each round, a context h arrives, and one of the active strategies (u, uc) 
with h € uc is chosen: namely the one with the maximal index, and then a document x € u is picked 
uniformly at random. The index of (u, uc) is, essentially, the best available upper confidence bound 
on expected rewards from choosing a document x € u given a context h € uc. The index is defined 
via sample average, confidence radius (14), and “width” W(u x uc). The latter can be any upper 
bound on the diameter of the product set u x uc in the DC-space: 


W(u,uc) > max D(x,x)+D.(h,h'). (17) 
xx’ Eu, h,h! Eus 
The (de)activation rule ensures that the active strategies form a finer partition in the regions of the 
DC-space that correspond to higher rewards and more frequently occurring contexts. 

Provable guarantees. The provable guarantees for the contextual MAB problem are in terms 
of contextual regret, which is regret is with respect to a much stronger benchmark: the best arm in 
hindsight for every given context. 

Regret guarantees for ContextZoom focus on the DC-space (Xac,Dac). A very pessimistic 
regret bound is Equation (16) with d = CovDim(Xac, Dac). However, as for the zooming algorithm, 
much better instance-dependent bounds are possible. See Appendix C for further discussion. 


6. New Approach: Ranked Contextual Bandits 


We now present a new approach in which the upper slot selections are taken into account as a context 
in the contextual MAB setting. 

The slot algorithms in the RankBandit setting can make their selections sequentially. Then 
without loss of generality each slot algorithm 4; knows the set S of documents in the upper slots. 
We propose to treat S as a “context” to 4;. Specifically, 4; will assume that none of the documents in 
S is clicked, that is, event Zs happens (else the i-th slot is ignored by the user). For each such round, 
the click probabilities for 4; are given by u(- |Zs), which is an L-continuous function on (X,D). 
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6.1 RankCorrZoom: “Light-weight” Ranked Contextual Algorithm 


We first propose a simple modification to RankZoom, called RankCorrZoom, which uses the contexts 
as discussed above. 

Recall that in the zooming algorithm, the index of an active subtree u is defined so that, assuming 
stochastic rewards, it is an upper confidence bound on the click probability of any document x in 
this subtree: 


w.h.p. index(u) > max L(x). (18) 


Moreover, it follows from the analysis in (Kleinberg et al., 2008b) that performance of the algorithm 
improves if the index is decreased as long as Equation (18) holds. 

Now consider RankZoom, and let 4; be the instance of the zooming algorithm in slot i > 2. 
While for A; the rewards are no longer stochastic, our intuition for why RankZoom may be a good 
algorithm is still based on Equation (18). In other words, we wish that for each context S C X we 
have 


w.h.p. index(u) > max u(x|Zs), (19) 
xeu 
and our intuition is that it is desirable to decrease the index as long as Equation (19) holds. 


We will derive an upper bound on maXyeu u(x|Zs) using correlation between u and S, and we 
will cap the index of u at this quantity. Since u(y|Zs) = 0 for any y € S, we have 


u(x|Zs) = |uxlZs) — u0l|Zs)| < DO y), Vy ES 
u(x|Zs) < D(x, S) = mines D(x,y). (20) 


In other words, if document x is close to some document in S, the event Zs limits the conditional 
probability u(x|Zs). Therefore we can cap the index of u at MaXxcu D(x, S): 


index(u) + min (andex(u), max D(x, s) : 
xEeu 


The version of RankZoom with the above “correlation rule” will be called RankCorrZoom. 

To simplify the computation of max,<, D(x,S) in an €-exponential tree metric, we note that it 
is equal to D(root(u), S) if u is disjoint with S, and in general it is equal to D(root(v), S), where 
v is the largest subtree of u that is disjoint with S. 


6.2 Contextual Lipschitz MAB Interpretation 


Let us cast each slot algorithm 4; as a contextual algorithm in the contextual Lipschitz MAB setting 
(as defined in Section 5.5). We need to specify a metric De on contexts S C X which can be computed 
by the algorithm and satisfies the Lipschitz condition: 


|u(x|Zs) — u(x|Zsr)| < D-(S,S’) for allx € X and S,S' CX. (21) 

Lemma 9 Consider the k-slot Lipschitz MAB problem. For any S,S' C X, define 
D. (S, S’) 24 inf Zh- D(x;,x}), (22) 
where the infimum is taken over all n € N and over all n-element sequences {xj} and {x} that 


enumerate, possibly with repetitions, all documents in S and S'. Then D, satisfies Equation (21). 
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Proof For shorthand, let us write 


5(x|S) = 1 — u(x|Zs), 
5(x|S,y) = O(x|SU {y}). 


First, we claim that for any y € X andy’ € S 


|o(x|S,y) — o(x|S,y’)| < 4D(y,y’). (23) 





Stab): we can re-write the left-hand side of Equation (23) 


Indeed, noting that o(x|S,y) = o(y|S,x) suis) 


o(y|S,x) — o(y’|S,x) 
o(y|S) — o(y’|S) 
OIS) + o(9|S,x) 
ols) o(y’|S) 
x|S) + o(x|S,y) 
o(y’|S) 


In Equation (24), we have used the L-continuity of o(-|S) and o(-|S,x). To achieve the constant of 
2, it was crucial that y’ € S, so that o(y’|S) = 1. This completes the proof of Equation (23). 

Fix some n € N and some n-element sequences {x;} and {x;} that enumerate, possibly with 
repetitions, all values in S and S’, respectively. Consider sets 


LHS (23) = 0(x,S) 








< o(x,5) D(y,y") Ž (24) 





= Diy,y’) ei <2D(y,y'). 


Sido oe ed U LXi, ca a Dea, 
and let Sp = S and S,,,; = S’. To prove the lemma, it suffices to show that 
|5(x]S;) — O(x|Sis1)| < 4D (xii, Xi) (25) 


for each i < n. To prove Equation (25), fix i and let y = x;,; and y' = x}, ,. Note that S;U {y’} = 
Si+1 U {y}, call this set S*. Then using Equation (23) (note, y € S; and y’ € S!) we obtain 


|o(x]S;) — 6(x|S*)| = [olsi y) — o01Si,)| 


<2D(y,y'), 
lo(xlSi1) — O(a] S*)| = Josi, — olsi ,y)] 
<2D(y,y’), 
which implies Equation (25). E 


6.3 RankContextZoon: “Full-blown” Ranked Contextual Algorithm 


Now we can take any algorithm for the contextual Lipschitz MAB problem (with metric De on con- 
texts given by Equation (22)), and use it as a slot algorithm. We will use ContextZoom, augmented 
by the “correlation rule” similar to the one in Section 6.1. The resulting “ranked” algorithm will be 
called RankContextZoom. 
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The implementation details are not difficult. Suppose the metric space on documents is the £- 
exponential tree metric, and let ty be the document tree. Consider slot (i+ 1)-th slot, i > 1.!! Then 
the contexts are unordered i-tuples of documents. Let us define context tree Te as follows. Depth-£ 
nodes of Te are unordered i-tuples of depth- nodes from Tg, and leaves are contexts. The root of 
Te is (r...r), where r = root (ta). For each internal node ue = (u1 ...uj) Of Te, its children are all 
unordered tuples (vı ...v;) such that each v j is a child of u; in tg. This completes the definition of 
Te. Letting u and uc be level-f subtrees of Tg and Te, respectively, it follows from the definition of De 
in Equation (22) that D.(S,S’) < 4ie’ for any contexts S,S’ € uc. Thus setting W(u x uc) £ "(41+ 1) 
satisfies Equation (17). 

We define the “correlation rule” as follows. Let (u,uc) be an active strategy in the execution 
of ContextZoom, where u is a subtree of the document tree Tg, and uc is a subtree of the context 
tree Te. It follows from the analysis in (Slivkins, 2009) that decreasing the index of (u, uc) improves 
performance, as long it holds that 


index(u,uc) > u(x|Zs), Vx Eu, S € uc. 
Recall that u(x|Zs) < D(x,S) by Equation (20), so we can cap index(u, uc) at MaXxeu D(x|S): 
index(u,S) < min (index(u,5), max D(x\S)) 
This completes the description of RankContextZoonm. 


7. Provable Scalability Guarantees and Discussion 


Noting that for each slot i > k the covering dimension of the DC-space is at most k times the 
covering dimension of (X,D), it follows that a (very pessimistic) upper bound on contextual regret 
of RankContextZoom is R(T) = O(aT!'/(4+?)). Plugging this into Equation (13), we obtain: 


Theorem 10 Consider the k-slot Lipschitz MAB problem on a metric space with covering dimen- 
sion d (as defined in Equation (15), with constant 0). Then after T rounds algorithm RankContext- 
Zoom achieves 

















E|#clicks| 1 z ak 
S UPTO aaa | 


This is just a basic scalability guarantee which does not degenerate with the number of docu- 
ments. (Note that it is worse than the one for RankGridEXP3.) We believe that this guarantee is 
very pessimistic, as it builds on a very pessimistic version of the result for ContextZoom. In partic- 
ular, we ignore the intuition that for a given slot, contexts S C X may gradually converge over time 
to the greedy optimum, which effectively results in a much smaller set of possible contexts.!? We 
believe this effect is very important to the performance RankContextZoonm. In particular, it causes 
RankContextZoom to perform much better than RankGridEXP3 in simulations. 





11. For slot 1, contexts are empty, so ContextZoom reduces to Algorithm 3. 
12. It is also wasteful (but perhaps less so) that we use a slot-k bound for each slot i < k. 
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7.1 A Better Benchmark 


Recall that while the bound in Equation (13) uses (1 — 1) OPT as a benchmark, a more natural 
benchmark would be the greedy optimum. We provide a preliminary convergence result for Rank- 
ContextZoom, without any specific regret bounds. 

Such result is more elegantly formulated in terms of a version of RankContextZoom, hence- 
forth called anytime-RankContextZoom, which uses the anytime version of ContextZoom (see 
Section 5.3). 


Theorem 11 Fix an instance of the k-slot MAB problem. The performance of anytime-Rank- 
ContextZoom up to any given time t is equal to the greedy optimum minus f(t) such that f(t) > 0. 


Proof Sketch It suffices to prove that with high probability, anytime-RankContextZoom outputs a 
greedy ranking in all but f;,(¢) rounds among the first t rounds, where f;(t) — 0. 

We prove this claim by induction on k, the number of slots. Suppose it holds for some k — 1 
slots, and focus on the k-th slot. Consider all rounds in which a greedy ranking is chosen for the 
upper slots but not for the k-th slot. In each such round, the k-th slot replica of anytime-Context- 
Zoom incurs contextual regret at least 5,, for some instance-specific constant 6, > 0. Thus, with 
high probability there can be at most R;(t)/5; such rounds, where R;(t) = o(t) is an upper bound 
on contextual regret for slot k. Thus, one can take f(t) = fr—-1(t) + Re(t) /5x. a 


Theorem 11 is about the “metric-less” setting from Radlinski et al. (2008). It easily extends 
to the “ranked” version of any bandit algorithm whose contextual regret is sublinear with high 
probability. 

It is an open question whether (and under which assumptions) Theorem 11 can be extended to 
the “ranked” versions of non-contextual bandit algorithms such as RankUCB1. One assumption that 
appears essential is the uniqueness of the greedy ranking. To see that multiple greedy rankings may 
cause problems for ranked non-contextual algorithms, consider a simple example: 


e There are two slots and three documents x;,x2,x3 such that u = G, 5, 1) and the relevance of 
each arm is independent of that of the other arms. !3 


An optimal ranking for this example is a greedy ranking that puts x; and x2 in the two slots, achieving 
aggregate click probability i. According to our intuition, a “reasonable” ranked non-contextual 
algorithm will behave as follows. The slot 1 algorithm will alternate between x; and x2, each 
with frequency > $. Since the slot-2 algorithm is oblivious to the slot 1 selection, it will observe 
averages that converge over time to G, ip 314 so it will select document x3 with frequency — 1. 
Therefore frequency — 1 the ranked algorithm will alternate between (x,z) or (y,z), each of which 
has aggregate click probability Z. 





13. Here documents x;,x2,x3 can stand for disjoint subsets of documents with highly correlated payoffs. Documents 
within a given subset can lie far from one another in the metric space. 

14. Suppose xj, j € {1,2} is chosen in slot 1. Then, letting S = {xj}, u(x1|Zs) equals 0 if j = 1 and } otherwise (which 
averages to 1), whereas p(x3|Zs) = L. 
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RankUCB1 metric-oblivious algorithms: Section 5.1 
RankEXP3 ranked versions of UCB1 and EXP3 
RankGridUCB1 simple metric-aware algorithms: Section 5.4 
RankGridEXP3 ranked versions of GridUCB1 and GridEXP3 
RankZoom the ranked version of the zooming algorithm Section 5.4 
contextual algorithms: 
RankCorrZoom “light-weight” (based on the zooming algorithm) | Section 6.1 
RankContextZoom “full-blown” (based on ContextZoom). Section 6.3 


Table 1: Algorithms for the k-slot Lipschitz MAB problem. 


7.2 Desiderata 


We believe that the above guarantees do not reflect the full power of our algorithms, and more 
generally the full power of conditional L-continuity. The “ideal” performance guarantee for Rank- 
Bandit in our setting would use the greedy optimum as a benchmark, and would have a bound on 
regret that is free from the inefficiencies outlined in the discussion after Theorem 10. Furthermore, 
this guarantee would only rely on some general property of Bandit such as a bound on regret 
or contextual regret. We conjecture that such guarantee is possible for RankContextZoom, and, 
perhaps under some assumptions, also for RankCorrZoom and RankZoon. 

Further, one would like to study the relative benefits of the new “contextual” algorithms (Rank- 
ContextZoom and RankCorrZoom) and the prior work such as RankZoom. The discussion Sec- 
tion 7.1 suggests that the difference can be particularly pronounced when the pointwise mean has 
multiple peaks of similar value. In fact, we confirm this experimentally in Section 8.4. 


8. Evaluation 


Let us evaluate the performance of the algorithms presented in Section 5 and Section 6. We sum- 
marize these algorithms in Table 8. 

In all UCB1-based algorithms in Table 8, including all extensions of the zooming algorithm, 
one can damp exploration by replacing the 4log(7) factor in Equation (14) with 1. Such change 
effectively makes the algorithm more optimistic; it was found beneficial for RankUCB1 by Radlinski 
et al. (2008). We find (see Section 8.3) that this change greatly improves the average performance 
in our experiments. So, by a slight abuse of notation, we will assume this change from now on. 


8.1 Experimental Setup 


Using the generative model from Section 4 (Algorithm 1 with Equation (8)), we created a document 
collection with |X| = 2!5 ~ 32,000 documents!» in a binary -exponential tree metric space with 
€ = 0.837 (and constant c = 1, see Section 3.1). The value for € was chosen so that the most 
dissimilar documents in the collection still have a non-trivial similarity, as may be expected for web 
documents. Each document’s expected relevance u(x) was set by first identifying a small number 





15. This is a realistic number of documents that may be considered in detail for a typical web search query after pruning 
very unlikely documents. 
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of “peaks” y; € X, choosing u(-) for these documents, and then defining the relevance of other 
documents as the minimum allowed while obeying L-continuity and a background relevance rate 


Ho: 
u(x) =max(uo, £ — min; D(x,y;)). (26) 


For internal nodes in the tree, u is defined bottom-up (from leaves to the root) as the mean value of 
all children nodes. As a result, we obtain a set of documents X where each document x € X has an 
expected click probability u(x) that obeys L-continuity. 

Our simulation was run over a 5-slot ranked bandit setting, learning the best 5 documents. We 
evaluated over 300,000 user visits sampled from P per Algorithm 1. Performance within 50,000 
impressions, typical for the number of times relatively frequent queries are seen by commercial 
search engines in a month, is essential for any practical applicability of this approach. However, 
we also measure performance for a longer time period to obtain a deeper understanding of the 
convergence properties of the algorithms. 

We consider two models for u(-) in Equation (26). In the first model, two “peaks” {y1,y2} 
are selected at random with u(-) = F, and pg set to 0.05. The second model is less “rigid” (and 
thus more realistic): the relevant documents y; and their expected relevance rates u(-) are selected 
according to a Chinese Restaurant Process (Aldous, 1985) with parameters n = 20 and 0=2, and 
setting uo = 0.01. The Chinese Restaurant Process is inspired by customers coming in to a restaurant 
with an infinite number of tables, each with infinite capacity. At time ¢, a customer arrives and can 
choose to sit at a new table with probability 6/(t — 1 + 0), and otherwise sits at an already occupied 
table with probability proportional to the number of customers already sitting at that table. By 
considering each table as equivalent to a peak in the distrubtion, this leads to a set of peaks with 
expected relevance rates distributed accoring to a power law. Following Radlinski et al. (2008), we 
assign users to one of the peaks, then select relevant documents so as to obey the expected relevance 
rate u(x) for each document x. 

As baselines we use an algorithm ranking the documents at random, and the (offline) greedy 
algorithm discussed in Section 5.1. 


8.2 Main Experimental Results 


Our experimental results are summarized in Figure 1 and Figure 2. 

RankEXP3 and RankUCB1 perform as poorly as picking documents randomly: the three curves 
are indistinguishable. This is due to the large number of available documents and slow convergence 
rates of these algorithms. Other algorithms that explore all strategies (such as REC Radlinski et al., 
2008) would perform just as poorly. This result is consistent with results reported by Radlinski et al. 
(2008) on just 50 documents. On the other hand, algorithms that progressively refine the space of 
strategies explored perform much better. 

RankCorrZoom achieves the best empirical performance, converging rapidly to near-optimal 
rankings. RankZoom is a close second. The theoretically preferred RankContextZoom comes third, 
with a significant gap. This appears to be due to the much larger branching factor in the strategies 
activated by RankContextZoom slowing down the convergence. (However, as we investigate in 
Section 8.4, RankContextZoom may significantly outperform the other algorithms if u has multiple 
peaks with similar values.) 
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Figure 1: The learning algorithms on 5-slot problem instances with two relevance peaks. 
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Figure 2: The learning algorithms on 5-slot problem instances with random relevance rates u(-) 
selected according to the Chinese Restaurant Process. 
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8.3 “Optimistic” vs. “Pessimistic” UCB1-style Algorithms 


We find that the “optimistic” UCB1-style algorithms (obtained by replacing the 4log(7) factor 
in Equation (14) with 1) perform dramatically better than their “pessimistic” counterparts. In 
Figure 3 and Figure 4 we compare RankUCB1 and RankZoom with their respective “pessimistic” 
versions (which are marked with a “--” after the algorithm name). We saw a similar increase in 
performance for other UCB1-style algorithms, too. 


8.4 Secondary Experiment 


As discussed in Section 7.1, some RankBandit-style algorithms may converge to a suboptimal 
ranking if u has multiple peaks with similar values. To investigate this, we designed a small-scale 
experiment presented in Figure 5. We generated a small collection of 128 documents using the 
same setup with two “peaks”, and assumed 2 slots. Each peak corresponds to a half of the user 


population, with peak value u = 5 and background value up = 0.05. 

We see that RankContextZoom converges more slowly than the other zooming variants, but 
eventually outperforms them. This confirms our intuition, and suggests that RankContextZoom 
may eventually outperform the other algorithms on a larger collection, such as that used for Figures 


1 and 2. 


9. Further Directions 


This paper initiates the study of bandit learning-to-rank with side information on similarity between 
documents, focusing on an idealized model of document similarity based on the new notion of 
“conditional Lipschitz-continuity”. As discussed in Section 7, we conjecture that provable perfor- 
mance guarantees can be improved significantly. On the experimental side, future work will include 
evaluating the model on web search data, and designing sufficiently memory- and time-efficient 
implementations to allow experiments on real users. An interesting challenge in such an endeavor 
would be to come up with effective similarity measures. A natural next step would be to also exploit 
the similarity between search queries. 


Appendix A. Proof of Lemma 4 (Extending u from Leaves to Tree Nodes) 


Recall that Lemma 4 is needed to define the generative model in Section 4. We will prove a slightly 
more general statement: 


Lemma 12 Let D be the shortest-paths metric of an edge-weighted rooted tree with node set V and 
leaf set X. Let u: X — [a,b] be an L-continuous function on (X ,D). Then u can be extended to V so 
that u: V — [a,b] is L-continuous w.rt. (V,D). 


Proof For each x € V, let L(x) be the set of all leaves in the subtree rooted at x. For each z € L(y) 
the assignment u(x) should satisfy 


H(z) — D(x,z) < u(x) < u(z) + D(x, z) 
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Figure 3: “Optimistic” vs. “pessimistic” UCB1-style algorithms: 
The learning algorithms on 5-slot problem instances with two relevance peaks. 
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Figure 4: “Optimistic” vs. “pessimistic” UCB1-style algorithms: 
The learning algorithms on 5-slot problem instances with random relevance rates u(-) 
selected according to the Chinese Restaurant Process. 


424 


RANKED BANDITS IN METRIC SPACES 


4 


o 
o 


o 
oœ 


2 
x 






Mean reward per recent time step 





—— Greedy Optimum 
0.6 FBR rn RankZoom+ 
iati RankCorrZoom+ 
ainan RankContextZoom+ 
0.5 fi 1 L L 
5 10 15 20 25 30 


Presentation time steps (in thousands) 


Figure 5: Zooming-style algorithms in a two-slot setting over a small document collection. 


Thus u(x) should lie in the interval I(x) £ [u7 (x), u* (x)], where 


u (x) SUPze L(x) u(z) u D(x,z), 
ut (x) Ê infz cx) u(z) + D(x,2). 


li> 


This interval is always well-defined, that is, u~ (x) < u* (x). Indeed, if not then for some z, z/ € L(x) 


a(z) — D(x,z) > a(z’) +D(x,z’) 
a(z) — w(z’) > D(x,z)+D(x,z) > D(z,z), 


contradiction, claim proved. Note that u*(x) > a and p(x) < b, so the intervals /(x) and [a,b] 
overlap. 

Using induction on the tree, we will construct values u(x), x € V such that the Lipschitz condi- 
tion 


|u(x) —uy)| < D(x,y) forall x,y € X 


holds whenever x is a parent of y. For the root xo, let u(xọ) be an arbitrary value in the interval 
I(xo) [a,b]. For the induction step, suppose for some x we have chosen u(x) € I(x) N [a,b] and y is 
a child of x. We need to choose u(y) € I(y) A [a,b] so that |u(x) — u(y)| < D(x,y). Note that 





M(x) > w(x) 2 supze co) lulz) -P(x y) — DO, 2) 
=p (y) —D(x,y), 

u(x) < u” (x) < infzeco) lulz) + D(x, y) + DO,z)] 
=u" (y) + D(x,y) 





It follows that J(y) and [u(x) — D(x,y), u(x) + D(x,y)] have a non-empty intersection. Therefore, 
both intervals have a non-empty intersection with [a,b]. So we can choose u(y) as required. This 
completes the construction of u() on V. 
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To check that u is Lipschitz-continuous on V, fix x,y € V, let P be the x — y path in the tree, 
and note that 


u(x) — u(y) | < Lawyer lulu) — u(r) | 
< Euver Dlu, v) = D(x,y). 


Appendix B. Proof of Theorem 5 (Expressiveness of the Model) 


Recall that a proof sketch for Theorem 5 was given in Section 4. In this section we complete this 
proof sketch by proving Equation (12). 

Notation. Let us introduce the notation (some of it is from the proof sketch). 

For a tree node u, let 7, be the node set of the subtree rooted at u. For convenience (and by a 
slight abuse of notation) we will write u = b, b € {0,1} to mean 2(u) = b. 

Fix documents x,y € X. We focus on the key event, denoted £, that no mutation happened 
on the x — y path. Recall that in Algorithm 1, for each tree node u with parent v we assign 
n(u) < M,(n(v)), where M, : {0,1} — {0,1} is a random mutation which flips the input bit b 
with probability g,(u). If M, is the identity function, then we say that no mutation happened at 
u. We say that no mutation happened on the x — y path if no mutation happened at each node in 
Nyy, the set of all nodes on the x — y path except z. This event is denoted £; note that it implies 
n(x) = n(y) = T(z). Its complement £ is, intuitively, a low-probability “failure event”. 

Fix a subset of documents S$ C X. Recall that Zs denotes the event that all documents in S are 
irrelevant, that is, (x) = 0 for all x € S. 

What we need to prove. We need to prove Equation (12), which states that 


Pr[E | Zs] < 3 Pr[£]. 
It suffices to prove the following lemma: 
Lemma 13 Pr[£|Zs] < Pr[£] x (2/Pr[£)). 
> T=p 


(Indeed, letting p = Pr[£] it holds that Pr[£| Zs] < min (1 75) <3p.) 


Remark. Lemma 13 inherits assumptions (6-7) on the mutation probabilities. Specifically for this 
Lemma, the upper bound (6) on mutation probabilities can be replaced with a much weaker upper 
bound: 


max(qo(u), qi(u)) <5 for each tree node u. (27) 
Our goal is to prove Lemma 13. In a sequence on claims, we will establish that 
Pr[Zs|z = 0] > Pr[Zs|z= 1]. (28) 


Intuitively, (28) means that the low-probability mutations are more likely to zero out a given subset 
of the leaves if the value at some fixed internal node is zero (rather than one). 
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B.1 Using Equation (28) to Prove Lemma 13 


Let us extend the notion of mutation from a single node to the x — y path. Recall that N,y denotes the 
set of all nodes on this path except z. Then the individual node mutations {M, : u € Nyy} collectively 
provide a mutation on Ny, which we define simply as a function M : Nyy x {0,1} — {0, 1} such that 
m(-) = M(-,n(z)). Crucially, M is chosen independently of m(z) (and of all other mutations). Let M 
be the set of all possible mutations of N,,. By a slight abuse of notation, we treat the event £ as the 
identity mutation. 


Claim 14 Fix M € M and b € {0,1}. Then 
Pr(Zs|M,n(2) = b] < Pr{Zs| £,n(c) = 0); 


Proof For each tree node u, let S, = SN T, be the subset of S that lies in the subtree J. Then 
by (28) 


Pr[Zs|M, T(z) = b] = JI, Pr[Zs, |e(u) = M(u,b)] 
< Il. Pr|Zs, | T(u 0] 
= Pr|Zs| E, n(z) = 0], 


J= 
j= 


where the product is over all tree nodes u € N,, such that the intersection S,, is non-empty. E 


Proof [Proof of Lemma 13] On one hand, by Claim 14 
Pr[ZsN £] = Xp m Pr[M] Pr{z = b] Pr[Zs|M, z= b] 
< Ep m Pr[M] Pr{z = b] Pr{Zs| £, z= 0] 


= Pr[£] x Pr|Zs| £, z = 0], 
where the sums are over bits b € {0,1} and all mutations M € M \ {£}. On the other hand, 
Pr[Zs] = Yor Pr[M] Priz = b] Pr{Zs |M, z = b] 
(where the sum is over b € {0,1} and M € M) 
> Pr[£] Pr|z = 0] Pr|Zs| £, z = 0]. 
Since Pr[z = 0] > 4, it follows that 


Pr[E | Zs] = Pr[Zs N £] / Pr[Zs] 
< 2 Pr[E]/Pr[£]. 
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B.2 Proof of Equation (28) 


First we prove (28) for the case $ C J, then we build on it to prove the (similar, but considerably 
more technical) case ST, = 0. The general case follows since the events Zsng, and Zs\z are 
conditionally independent given 1(z). 


Claim 15 Jf S C T; then (28) holds. 


Proof Let us use induction the depth of z. For the base case, the case x = y = z. Then S = {z} is 
the only possibility, and the claim is trivial. 

For the induction step, consider children u; of z such that the intersection S; 25n Tu; is non- 
empty. Let u1, ... ‚ug be all such children. For brevity, denote Z;  Zs., and 


vi(alb) = Pru; =a|z=b], a,be€ {0,1}. 


Note that v;(1,0) = go(x;) and v;(0,1) = qı (x). 
Then for each b € {0,1} we have 


Pr[Zs|z = b] = TT, Pr[Z;|z= b] (29) 
Pr|Z; | Z= b] = Lae {0,1} vi(a|b) Pr(Z; | uj = al. (30) 


By (29), to prove the claim it suffices to show that 
Pr[Z; |z = 0] > Pr|Z;|z = 1] 
holds for each i. By the induction hypothesis we have 
Pr[Z; | uj = 0] > Pr[|Z; | u; = 1]. (31) 
Combining (31) and (27), and noting that by (30) we have v;(0|0) > v;(0|1), it follows that 


Pr[Z; |z = 0] —Priz;|z= 1] 
= Yaeo.} Pr[Z; |u: = a] (v:(al0) — vi(a|1) ) 
> Pr[Z; |u = 1] Xacto} (vila|0) — vi(al1) ) 
=0 


because v;(0|0) +v;(1|0) = v:(0|1)+v;(1|1) = 1. 7 


Corollary 16 Consider tree nodes r,v,w such that r is an ancestor of v which in turn is an ancestor 
of w. Then for any c € {0,1} 


Prlu =0|w=0,r =c] > Prlu=O0|w=1,r=cl. 
Proof We claim that for each b € {0,1} 


Pr[w = b| u = b] > Pr|w = b|u = 1 — b]. (32) 
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Indeed, truncating the subtree Jy to a single node w and specializing Lemma 15 to a singleton set 
S = {w} (with z = u) we obtain (32) for b = 0. The case b = 1 is symmetric. 

Now, for brevity we will omit conditioning on {r = c} in the remainder of the proof. (Formally, 
we will work on in the probability space obtained by conditioning on this event.) Then for each 
be {0,1} 


Prlu =0|w=)] 


a Prlu=OAw =D] 
~ Prlu=OAw = 6b] UPrlu= 1Aw=5] 











1 
~ 14 8(b)’ 
where 
(b) ê Prlu=1Aw=b)] 
Pr[u=0Aw=b] 
_ Pr[w=b|u= 1] Pr[u = 1] 
Pr|w = b| u = 0] Pr[u = 0] 
is decreasing in b by (32). E 


We will also need a stronger, conditional, version of Lemma 15 whose proof is essentially 
identical (and omitted). 


Claim 17 Suppose S C T; and u ¥ z is a tree node such that T, is disjoint with S. Then 
Pr|Zs|z = 0, u = 1] > Pr|Zs|z= 1, u = 1]. 
We will use Corollary 16 and Lemma 17 to prove (28) for the case SN T; = 0. 
Claim 18 7f S is disjoint with T; then (28) holds. 


Proof Suppose S is disjoint with Z;, and let r be the root of the tree. We will use induction on the 
tree to prove the following: for each c € {0,1}, 


Pr|Zs|r = c, z = 0] > Pr[Zs|r = c,z = 1] (33) 


For the induction base, consider a tree of depth 2, consisting of the root r and the leaves. Then z ¢ S 
is a leaf, so Zs is independent of T(z) given T(r), so (33) holds with equality. 

For the induction step, fix c € {0,1}. Let us set up the notation similarly to the proof of Claim 15. 
Consider children u; of r such that the intersection S; = SN Tu is non-empty. Let u1, ... ug be all 
such children. Assume z € J; for some i (else, Zs is independent from 1(z) given T(r), so (33) 
holds with equality); without loss of generality, assume this happens for i = 1. For brevity, for 
a,b € {0,1} denote 


fila,b) = Pr[Zs,  |u; =a, z=] 
vi(a|b) £ Pr[u; =alr=c, z= b]. 
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Note that f;(a,b) and v;(a|b) do not depend on b for i > 1. 
Then for each b € {0,1} 


PriZs|r=c, z=) 
ma L I] filai,b) v;(ailb) 


aic{0,1}, i21 il 
= @ x Pac{0,1} fı (a,b) vi(ald), 


where 


es y [] fila.2) vilailb) 
aie {0,1}, i>2 722 


does not depend on of b. Therefore: 


Pr[Zs |r =c, z= 1] — Pr|Zs|r = c,z = 1] 


= ® x Yue{o,1} 

[ fı (a,0)vı(a|0) — fı (a, 1)vı(aļ1) | (34) 
> ®x Lacto, fi (a, 1) [vi (a0) — vi (a|1)] (35) 
> ®x fi(1,1) Zacto,1} [Vi (al0) — vi (all) | (36) 
=0. (37) 


The above transitions hold for the following reasons: 
(34 35) By Induction Hypothesis, fı (a,0) > fi(a, 1) 
(35—36) By Lemma 17 1 (0,1) > fi(1,1), and moreover we have v;(0|0) > vi (0|1) by Corol- 


lary 16. 
(36 — 37) Since v;(0|0) + v;(1|0) = v;(0|1) + vi(1|1) = 1 
This completes the proof of the inductive step. | 


Appendix C. Instance-Dependent Regret Bounds from Prior Work 


In this section we discuss instance-dependent regret bounds from prior work on UCB1-style algo- 
rithms for the single-slot setting. The purpose is to put forward a concrete mathematical evidence 
which suggests that RankGridUCB1, RankZoom and RankCorrZoom are likely to satisfy strong up- 
per bounds on regret in the k-slot setting (perhaps under some additional assumptions), even if such 
bounds are beyond the reach of our current techniques. Similarly, we believe that the regret bound 
for RankContextZoom that we have been able to prove (Theorem 10) is overly pessimistic. A 
secondary purpose is to provide more intuition for when these algorithms are likely to excel. 

Our story begins with the comparison between the guarantees for EXP3 and UCB1 in the stan- 
dard (single-slot, metric-free) bandit setting, and then progresses to Lipschtz MAB and contextual 
Lispchitz MAB. 

In what follows, we let u denote the vector of expected rewards in the stochastic reward setting, 
so that u(x) is the expected reward of arm x. Let A(x) = max u(-) — u(x) denote the “badness” of 
arm x compared to the optimum. 
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C.1 Standard Bandits: UCB1 vs. EXP3 


Algorithm EXP3(Auer et al., 2002b) achieves regret R(T) = O(./nT) against an oblivious adversary. 
In the stochastic setting, UCB1(Auer et al., 2002a) performs much better, with logarithmic regret for 
every fixed u. More specifically, each arm x € X contributes only O(log T) /A(x) to regret. Noting 
that the total regret from playing arms with A(-) < 6 can be a priori upper-bounded by 57, we bound 
regret of UCB1 as: 





eet dss O(log T) 
R(T) =] ety (èr +È xex: A()>5 AG) ) e (38) 


Note that Equation (38) depends on u. In particular, if A(-) > 6 then R(T) = O(§ logT). 

However, for any given T there exists a “worst-case” pointwise mean ur such that R(T) = 
©(VnT) in Equation (38), matching EXP3. The above regret guarantees for EXP3 and UCB1 are 
optimal up to constant factors (Auer et al., 2002b; Kleinberg et al., 2008a). 


C.2 Bandits in Metric Spaces 


Let (X,D) denote the metric space. Recall that the covering number N,(X) is the minimal number 
of balls of radius r sufficient to cover X, and the covering dimension is defined as 


CovDim(X,D) £ inf{d > 0: N,(X) <ar! Vr> 0}. 


(Here œ > 0 is a constant which we will keep implicit in the notation.) 
Against an oblivious adversary, GridEXP3 has regret 


R(T) = O(a T(t V/442)) (39) 


where d is the covering dimension of (X,D). 

For the stochastic setting, GridUCB1 and the zooming algorithm have better u-specific regret 
guarantees in terms of the covering numbers. These guarantees are similar to Equation (38) for 
UCB1. In fact, it is possible, and instructive, to state the guarantees for all three algorithms in a 
common form. 

Consider reward scales 5 = {2' : i € N}, and for each scale r € $ define 


Xp ={xEX:r< A(x) <2r}. 
Then regret (38) of UCB1 can be restated as 


F O(logT 
R(T) = nn (èr F Eres:r>8N(8,r) aln) ; (40) 


where No.7) = |X,|. Further, it follows from the analysis in (Kleinberg, 2004; Kleinberg et al., 
2008b) that regret of GridUCB1 is Equation (40) with Nig) = Ng(X;). For the zooming algorithm, 
the u-specific bound can be improved to Equation (40) with Mg.) = N,(X,). These results are 
summarized in Table C.2. 

For the worst-case u one could have Ns(X;) = Ns(X), in which case the u-specific bound for 
GridUCB1 essentially reduces to Equation (39). 
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algorithm | regret is (40) with ... 
UCB1 Nor) = |X,| 
GridUCB1 Nor) = Ns 
zooming algorithm | Nes, = N; 
ContextZoom Nor) = Nr 





Table 2: Regret bounds in terms of covering numbers 


For the zooming algorithm, the u-specific bound above implies an improved version of Equa- 
tion (39) with a different, smaller d called the zooming dimension: 


ZoomDim(X,D,u) £ inf{d >0:N,(X,)<cr 4 Vr>0}. 


Note that the zooming dimension depends on the triple (X,D,u) rather than on the metric space 
alone. It can be as high as the covering dimension for the worst-case u, but can be much smaller 
(e.g., d = 0) for “nice” problem instances, see (Kleinberg et al., 2008b) for further discussion. For 
a simple example, suppose an €-exponential tree metric has a “high-reward” branch and a “low- 
reward” branch with respective branching factors b < b’. Then the zooming dimension is log, je(b), 
whereas the covering dimension is log, /.(b’). 


C.3 Contextual Bandits in Metric Spaces 


Let u(x|h) denote the expected reward from arm x given context h. Recall that the algorithm is given 
metrics D and D, on documents and contexts, respectively, such that for any two documents x, x 
and any two contexts h,h’ we have 


lu(x|h) — u(x |h")| < Dx, x) +De(h,h’). 


Let X, be the set of contexts, and Xac = X x Xe be the set of all (document, context) pairs. More 
abstractly, one considers the metric space (Xac, Dac), henceforth the DC-space, where the metric is 


Dac((x,h), (X, h')) = D(x,x') + De(h, h’). 
We partition Xac according to reward scales r € S: 
A(x|h) £ maxu(-|h)— u(x|h), xEX,hE€ Xe. 
Xac,r = {(x,h) € Xac : r < A(x|h) < 2r}. 
Then contextual regret of ContextZoom can be bounded by Equation (40) with Nr = Nr (Xac,r)s 


where N,(-) now refers to the covering numbers in the DC-space (see Table C.2). 
Further, one can define the contextual zooming dimension as 


dac(X,D,u) £ inf{d >0:N,(X-)<cr-4 Vr > 0}. 


Then one obtains Equation (39) with d = dac. In the worst case, we could have u such that 
N,(Xac,r) = N, (Xac), in which case dac < CovDim(Xac, Dac). 

The regret bounds for ContextZoom can be improved by taking into account “benign” context 
arrivals: effectively, one can prune the regions of X, that correspond to infrequent context arrivals, 
see (Slivkins, 2009) for details. This improvement can be especially significant if CovDim(Xc, De) > 
CovDim(X,D). 
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